18 research outputs found
Transformer for Emotion Recognition
This paper describes the UMONS solution for the OMG-Emotion Challenge. We
explore a context-dependent architecture where the arousal and valence of an
utterance are predicted according to its surrounding context (i.e. the
preceding and following utterances of the video). We report an improvement when
taking into account context for both unimodal and multimodal predictions
Modulating and attending the source image during encoding improves Multimodal Translation
We propose a new and fully end-to-end approach for multimodal translation
where the source text encoder modulates the entire visual input processing
using conditional batch normalization, in order to compute the most informative
image features for our task. Additionally, we propose a new attention mechanism
derived from this original idea, where the attention model for the visual input
is conditioned on the source text encoder representations. In the paper, we
detail our models as well as the image analysis pipeline. Finally, we report
experimental results. They are, as far as we know, the new state of the art on
three different test sets.Comment: Accepted at NIPS Worksho
Object-oriented Targets for Visual Navigation using Rich Semantic Representations
When searching for an object humans navigate through a scene using semantic
information and spatial relationships. We look for an object using our
knowledge of its attributes and relationships with other objects to infer the
probable location. In this paper, we propose to tackle the visual navigation
problem using rich semantic representations of the observed scene and
object-oriented targets to train an agent. We show that both allows the agent
to generalize to new targets and unseen scene in a short amount of training
time.Comment: Presented at NIPS workshop (ViGIL
Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation, an attention mechanism is
used during decoding to enhance the translation. At every step, the decoder
uses this mechanism to focus on different parts of the source sentence to
gather the most useful information before outputting its target word. Recently,
the effectiveness of the attention mechanism has also been explored for
multimodal tasks, where it becomes possible to focus both on sentence parts and
image regions. Approaches to pool two modalities usually include element-wise
product, sum or concatenation. In this paper, we evaluate the more advanced
Multimodal Compact Bilinear pooling method, which takes the outer product of
two vectors to combine the attention features for the two modalities. This has
been previously investigated for visual question answering. We try out this
approach for multimodal image caption translation and show improvements
compared to basic combination methods.Comment: Submitted to ICLR Workshop 201
An empirical study on the effectiveness of images in Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation (NMT), an attention mechanism
is used during decoding to enhance the translation. At every step, the decoder
uses this mechanism to focus on different parts of the source sentence to
gather the most useful information before outputting its target word. Recently,
the effectiveness of the attention mechanism has also been explored for
multimodal tasks, where it becomes possible to focus both on sentence parts and
image regions that they describe. In this paper, we compare several attention
mechanism on the multimodal translation task (English, image to German) and
evaluate the ability of the model to make use of images to improve translation.
We surpass state-of-the-art scores on the Multi30k data set, we nevertheless
identify and report different misbehavior of the machine while translating.Comment: Accepted to EMNLP 201
Bringing back simplicity and lightliness into neural image captioning
Neural Image Captioning (NIC) or neural caption generation has attracted a
lot of attention over the last few years. Describing an image with a natural
language has been an emerging challenge in both fields of computer vision and
language processing. Therefore a lot of research has focused on driving this
task forward with new creative ideas. So far, the goal has been to maximize
scores on automated metric and to do so, one has to come up with a plurality of
new modules and techniques. Once these add up, the models become complex and
resource-hungry. In this paper, we take a small step backwards in order to
study an architecture with interesting trade-off between performance and
computational complexity. To do so, we tackle every component of a neural
captioning model and propose one or more solution that lightens the model
overall. Our ideas are inspired by two related tasks: Multimodal and Monomodal
Neural Machine Translation
Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation
In Multimodal Neural Machine Translation (MNMT), a neural model generates a
translated sentence that describes an image, given the image itself and one
source descriptions in English. This is considered as the multimodal image
caption translation task. The images are processed with Convolutional Neural
Network (CNN) to extract visual features exploitable by the translation model.
So far, the CNNs used are pre-trained on object detection and localization
task. We hypothesize that richer architecture, such as dense captioning models,
may be more suitable for MNMT and could lead to improved translations. We
extend this intuition to the word-embeddings, where we compute both linguistic
and visual representation for our corpus vocabulary. We combine and compare
different confiComment: Accepted to GLU 2017. arXiv admin note: text overlap with
arXiv:1707.0099
Adversarial reconstruction for Multi-modal Machine Translation
Even with the growing interest in problems at the intersection of Computer
Vision and Natural Language, grounding (i.e. identifying) the components of a
structured description in an image still remains a challenging task. This
contribution aims to propose a model which learns grounding by reconstructing
the visual features for the Multi-modal translation task. Previous works have
partially investigated standard approaches such as regression methods to
approximate the reconstruction of a visual input. In this paper, we propose a
different and novel approach which learns grounding by adversarial feedback. To
do so, we modulate our network following the recent promising adversarial
architectures and evaluate how the adversarial response from a visual
reconstruction as an auxiliary task helps the model in its learning. We report
the highest scores in term of BLEU and METEOR metrics on the different
datasets
Can adversarial training learn image captioning ?
Recently, generative adversarial networks (GAN) have gathered a lot of
interest. Their efficiency in generating unseen samples of high quality,
especially images, has improved over the years. In the field of Natural
Language Generation (NLG), the use of the adversarial setting to generate
meaningful sentences has shown to be difficult for two reasons: the lack of
existing architectures to produce realistic sentences and the lack of
evaluation tools. In this paper, we propose an adversarial architecture related
to the conditional GAN (cGAN) that generates sentences according to a given
image (also called image captioning). This attempt is the first that uses no
pre-training or reinforcement methods. We also explain why our experiment
settings can be safely evaluated and interpreted for further works.Comment: Accepted to NeurIPS 2019 ViGiL worksho
Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition
This paper aims to bring a new lightweight yet powerful solution for the task
of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two
architectures based on Transformers and modulation that combine the linguistic
and acoustic inputs from a wide range of datasets to challenge, and sometimes
surpass, the state-of-the-art in the field. To demonstrate the efficiency of
our models, we carefully evaluate their performances on the IEMOCAP, MOSI,
MOSEI and MELD dataset. The experiments can be directly replicated and the code
is fully open for future researches.Comment: EMNLP 2020 workshop: NLP Beyond Text (NLPBT